Introduction
America is a highly diverse country. It is not only diverse in terms of ethnicity, but also in terms of income, industry, and law. This opens the doors for a variety of possible interactions between these variables. What factors drive the way that income is distributed in the United States? What factors reliably predict whether the average income per capita in a specific area is high or low? How does state level variations in law and freedom impact income?
How are these questions SMART?
These questions are important because they tell many facets of the story of consumption in the United States. Income serves both as a measure of productivity and lifetime consumption (although this analysis does not disentangle the two). Although their scope is broad, they remain specific to the concepts of income, demographics, and freedom, and maintain a consistent structure: how do demographics and freedom drive income in the United States, at the census tract level.
These questions also correspond to a set of highly measurable (And luckily, premeasured) variables. Income can be imputed from tax records, while ethnicity and work status are available from census forms. Achieving the answers to the questions is made simple by the cleanliness and availability of this data; since few data points are missing across all census tracts throughout the 50 states of interest, it is simple to form statistical tests.
Finally, these questions are relevant to policy makers who want to improve the incomes of their constituents as well as to researchers interested in establishing a baseline for the average income they should expect a community would earn based on its demographics. These are critical questions, because the ability of communities to support themselves economically has massive impacts on the wellbeing of their members.
Content
First, an examination is conducted on how the US Census Bureau database is structured, and which variables were included. Secondly, the groups of independent variables and how each of them could affect the income per capita of a community is presented. Then, an exploratory data analysis and some statistical tests are made to evaluate the significance of the variables. Later on, the models are executed, including clustering, classification, and regression. Finally, a conclusion looks into further challenges and questions necessary to enhance future analyses.
Dataset
U.S. Census Bureau Dataset
The U.S. Census Bureau Data holds the yearly American Community Survey: a project which asks Americans around the country about several dimensions of their lives, including work, income, demographics, and other activities (U.S. Census Bureau, 2019). The dataset from 2015 was available via Kaggle (MuonNeutrino, 2015), and included more than 74,000 observations, with 37 columns (variables). The dataset includes two variables related to income: the median household income and income per capita. The variable income per capita was prefered because it adjusts per person, and not per household given that it’s unknown how many people can live in an average household. The variable income per capita (IncomePerCap) is calculated as the average income per capita of the population of a specific census tract. But, what is a census tract and why use them?
Census tracts
Household’s income in America varies significantly by geographical location. The richest counties in the country are concentrated in urban areas near big metropolises where most businesses are located. The bay area in northern California, Northeast Virginia and New York are some examples. However, counties have been an insufficient unit to compare different variables among them. There are 3,142 counties in a country of 300 million inhabitants (U.S. Census Bureau, 2019), but among them are several inconsistencies. Texas, for example, has 254 counties (U.S. Census Bureau, 2017). California, a state with approximately 10 million people more than Texas, has only 58 counties (U.S. Census Bureau, 2017). Population-wise California has the largest county in the country with more than 10 million inhabitants (Los Angeles), whereas Texas has more than 80 counties with less than 10,000 people (U.S. Census Bureau, 2017). Density-wise, New York has 4 of 5 of the most dense counties in the country, some of them 60,000 times more dense than counties in Hawaii, Alaska or Nevada (U.S. Census Bureau, 2013). As a response to these inconsistencies found in counties in America, the U.S. Census Bureau delineated “Census Tracts” at the beginning of the twentieth century. A census tract is “geographic region defined for the purpose of taking a census.” Over the years, the U.S. Census Bureau has established census tracts in every county in America. There are over 74,000 census tracts in the country and a typical one has around 4,000 or so residents. There is a strength that comes from this consistency: census tracts are by and large similar in population size, and the population size of census tracts does not vary much from state to state.
Description of Variables
The complete dataset includes 17 independent variables and 1 dependent variable. Thanks to their nature, the independent variables were classified in three groups: Work Variation and Ethnic Variation.
Work Variation:
Professional: Percentage (%) employed in management, business, science, and arts in a census tract.
Service: Percentage (%) employed in service jobs in a census tract.
Office: Percentage (%) employed in sales and office jobs in a census tract.
Construction: Percentage (%) employed in natural resources, construction, and maintenance in a census tract.
Production: Percentage (%) employed in production, transportation, and material movement in a census tract.
Unemployed: Unemployment rate (%) in a census tract.
Self-employed: Percentage (%) self-employed in a census tract.
Ethnic Variation
Native: Percentage (%) of population that is Native American or Native Alaskan in a census tract.
White: Percentage (%) of population that is white in a census tract.
Black: Percentage (%) of population that is black in a census tract.
Hispanic: Percentage (%) of population that is Hispanic/Latino in a census tract.
Asian: Percentage (%) of population that is Asian in a census tract.
EDA
The exploratory data analysis of this dataset was divided into 4 sections:
- Population of census tracts
- Income per capita
- Independent variables
- Correlation matrix
Income per capita
Min. 1st Qu. Median Mean 3rd Qu. Max.
128 18776 24730 26140 32247 56040

Outliers and NA’s were removed from our dependent variable income per capita. Note that the histogram appears normal, tails of q-q are closer to the line, and mean and median are much closer together. Outliers were taken out of this variable because super wealthy individuals could have been exercising a significant pull on the distribution .
Independent variables
Work variation
Next the seven variables for work variations (professional, production, unemployment, office, service, construction, self-employed) were assessed for normality.
Factor w/ 4 levels "[0,5.3]","(5.3,7.9]",..: 2 4 2 3 1 3 3 3 3 2 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 5.300 7.900 9.251 11.600 100.000 101
The boxplots that exhibited a decrease in income, as more of the specific work variation was included in the census tract, were unemployment, service, construction, and production. That is to say, as more unemployed individuals were accounted for in a given census tract, the income per capita decreased.
Factor w/ 4 levels "[0,23.7]","(23.7,31.7]",..: 3 1 2 2 4 2 1 4 2 2 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 23.70 31.70 33.23 41.80 100.00 105


The only work variation that exhibited an increase in average income was professional work. Looking at the histograms of each of the variables it appeared that only the proportion of professionals was distributed normally. For professionals, the Q-Q plots affirmed the normality as the plot did not have the error terms straying far from the line with very small right and left tails. The same cannot be said for the other variables as each had an oversized right tail and a relatively small left tail. Overall the proportion of professionals appeared normally distributed while the other work variations did not.





The remaining variables of office and self-employed remained relatively stable across quartiles. The remaining six work variations were all skewed to the right.
Ethnic variation
Finally the five ethnic variables (Native, White, Black, Hispanic, and Asian) were investigated.
Factor w/ 4 levels "[0,37.1]","(37.1,70.3]",..: 3 2 3 3 2 3 3 3 4 3 ...

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 37.10 70.30 61.24 88.40 100.00

The boxplots for White showed an increase in average income between the first second and third quartiles but no change in the fourth.
Factor w/ 4 levels "[0,0.1]","(0.1,1.2]",..: 2 3 3 1 3 1 1 1 1 2 ...

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.100 1.200 4.347 4.400 91.300
The boxplot for Asian showed an increase from the first through the fourth quartile.
Factor w/ 4 levels "[0,2.4]","(2.4,7.2]",..: 1 1 1 3 1 3 2 1 1 1 ...

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 2.40 7.20 17.36 21.50 100.00

The boxplots for Hispanic slightly increased between the first and second quartile but did not change for the third quartile. The fourth quantile for Hispanic decreased significantly.
Factor w/ 4 levels "[0,0.8]","(0.8,4]",..: 3 4 4 2 4 3 4 3 3 3 ...

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.80 4.00 13.78 15.32 100.00

The boxplot for Black increased in average income between the first and second quartile. Then there was a decrease in average income from the second to the fourth quartiles. Overall, it appeared that average income did change based on concentration of ethnicities in a census tract.
The histogram for White was bimodal with the highest frequency at over 8,000. The histograms for the other four ethnicities were skewed to the right. Based on the histogram, it appeared that white had the highest responses followed by Hispanic, Black, Asian, and Native.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.7567 0.4000 100.0000

Also, there were not enough responses from the Native ethnicity to construct a meaningful boxplot. Therefore, based on the assessment of the boxplots, histograms, and Q-Q plots, none of the ethnicities appear normally distributed.
Correlation Matrix

PCA
A Principle Component Analysis (PCA) and Principle Component Regression (PCR) seemed suited to this dataset. The purpose of this technique is to decrease the number of variables while accounting for collinearity. Within this dataset there were 10 variables to explain IncomePerCap. However, the correlation matrix showed notable correlation between some of the predictor variables. For example, Professional had notable correlations with Service, Construction, Production and Unemployment, White had notable correlations with Hispanic and Black, etc. From this inital overview of the correlation matrix, the prospect of PCA seemed suitable and was continued.

The biplot above analyzed over 70k+ data points, resulting in the dense scattering of data. The axes of this plot were PC1 on the horizontal and PC2 on the vertical. PC1 had the most variation, between approximately -5 to 10, while PC2 went between -8 to 10. The variables White, Production, Unemployment, Black and Professional were pretty evenly split up between Pc1 and PC2. Other variables, such as Office, Service, Unemployed, Construction, etc. were majorly represented in either PC1 or PC2.
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.7878 1.3389 1.1653 1.0355 0.88819 0.82267 0.76933
Proportion of Variance 0.3196 0.1792 0.1358 0.1072 0.07889 0.06768 0.05919
Cumulative Proportion 0.3196 0.4989 0.6347 0.7419 0.82078 0.88845 0.94764
PC8 PC9 PC10
Standard deviation 0.71304 0.12303 0.003342
Proportion of Variance 0.05084 0.00151 0.000000
Cumulative Proportion 0.99849 1.00000 1.000000
The breakdown of the variation explained by each component showed that just over 60% of the variation was accounted for within the first three components. However, except for the first component, the change in the amount of variation explained in each consecutive component was similar. This was further illustrated by the following graph.

Call:
lm(formula = IncomePerCap ~ ., data = pcadata_pcr_rot)
Residuals:
Min 1Q Median 3Q Max
-57889 -3154 -136 3093 39355
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26167.82 20.93 1250.463 < 2e-16 ***
PC1 -4585.05 11.71 -391.701 < 2e-16 ***
PC2 -1454.29 15.63 -93.043 < 2e-16 ***
PC3 604.54 17.96 33.664 < 2e-16 ***
PC4 994.55 20.21 49.214 < 2e-16 ***
PC5 -878.20 23.56 -37.274 < 2e-16 ***
PC6 1377.18 25.44 54.140 < 2e-16 ***
PC7 -205.74 27.20 -7.564 3.96e-14 ***
PC8 -196.99 29.35 -6.712 1.93e-11 ***
PC9 3301.06 170.10 19.407 < 2e-16 ***
PC10 -3519.28 6262.21 -0.562 0.574
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5519 on 69556 degrees of freedom
Multiple R-squared: 0.7102, Adjusted R-squared: 0.7101
F-statistic: 1.704e+04 on 10 and 69556 DF, p-value: < 2.2e-16
A full principle component regression was performed, and all except the last component were deemed significant. It was also notable that this regression explained 71.01% of the variance in the dataset according the the adjusted R-Squared. The strongest variable was of course PC1 with a t-value with a magnitude by far larger than the rest of the variables.
A plot of the R-Square values over the number of components explained the amount of variation explained in the independent variable, IncomePerCap, based off of the components. The steeper increase and then petering off that occured in the R-Square graph seemed to indicate that a significant amount of the variation of the data in regards to IncomePerCap was explained using just the first component. Based on the initial analysis of the R-Square graph, and the results of the regression it seemed appropriate to run a regression on just PC1 which resulted in a lower Adjusted R Square.
Call:
lm(formula = IncomePerCap ~ PC1, data = pcadata_pcr_rot)
Residuals:
Min 1Q Median 3Q Max
-42965 -4026 -442 3546 36606
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26167.82 23.34 1121.0 <2e-16 ***
PC1 -4585.05 13.06 -351.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6157 on 69565 degrees of freedom
Multiple R-squared: 0.6393, Adjusted R-squared: 0.6393
F-statistic: 1.233e+05 on 1 and 69565 DF, p-value: < 2.2e-16
The results of this regression on just PC1 corroborated that the R-Square is smaller, at a value of 63.93%. The tradeoff between parsimony and description of these two potential models made the choice of model unclear. Assuming the more explanatory model, which accounted for the number of components included by adjusted R Square, was chosen, only one component would be removed. This was not an effective parsing down of variables. However, there would be low bias since only one component was dropped.
K- Means
List of 9
$ cluster : Named int [1:69567] 2 2 2 2 2 1 2 1 2 2 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:2, 1:11] -0.329 0.174 0.42 -0.222 -0.32 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "1" "2"
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:2] 1.15e+12 1.34e+12
$ tot.withinss: num 2.5e+12
$ betweenss : num 4.81e+12
$ size : int [1:2] 24086 45481
$ iter : int 1
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 2 clusters of sizes 24086, 45481
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.3285506 0.4200874 -0.3199683 0.2372900 0.9282858 -0.6150829
2 0.1739950 -0.2224715 0.1694500 -0.1256649 -0.4916051 0.3257379
Office Construction Production Unemployment IncomePerCap
1 -0.01890396 -0.4302842 -0.6556146 -0.5268802 37598.33
2 0.01001123 0.2278715 0.3472029 0.2790272 20114.40
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 1 2 2 1 1 2 2 1 2 2 2
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
2 2 1 1 1 1 1 2 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 1.153649e+12 1.344211e+12
(between_SS / total_SS = 65.8 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

List of 9
$ cluster : Named int [1:69567] 2 1 1 2 2 2 1 2 2 2 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:3, 1:11] 0.432 -0.24 -0.363 -0.575 0.34 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "1" "2" "3"
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:3] 4.33e+11 3.94e+11 3.93e+11
$ tot.withinss: num 1.22e+12
$ betweenss : num 6.09e+12
$ size : int [1:3] 27175 29638 12754
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 3 clusters of sizes 27175, 29638, 12754
Cluster means:
Hispanic White Black Asian Professional Service
1 0.4318377 -0.5751979 0.3952287 -0.15378102 -0.7689991 0.6127621
2 -0.2397814 0.3399978 -0.2076592 -0.02410908 0.1329131 -0.2111302
3 -0.3629095 0.4354829 -0.3595529 0.38368702 1.3296436 -0.8149861
Office Construction Production Unemployment IncomePerCap
1 -0.02605785 0.2974405 0.51204618 0.6194822 16576.79
2 0.07327428 0.0108065 -0.07894429 -0.3101802 27821.96
3 -0.11475466 -0.6588701 -0.90756657 -0.5991303 42759.54
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
2 1 1 2 2 2 1 2 2 2 2 1 1 1 2 2 2 1 3 3 2 2 2 1 2 2
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
1 1 2 2 3 2 3 1 3 3 2 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 433025278355 393691127581 392888818743
(between_SS / total_SS = 83.3 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

List of 9
$ cluster : Named int [1:69567] 4 2 4 4 4 1 4 1 4 4 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:4, 1:11] -0.302 0.668 -0.374 -0.132 0.407 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4] "1" "2" "3" "4"
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:4] 1.76e+11 1.92e+11 1.73e+11 1.75e+11
$ tot.withinss: num 7.15e+11
$ betweenss : num 6.6e+12
$ size : int [1:4] 17574 17698 8266 26029
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 4 clusters of sizes 17574, 17698, 8266, 26029
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.3016974 0.4066104 -0.28173021 0.1074823 0.5711741 -0.43512866
2 0.6680011 -0.8809881 0.57590840 -0.1651807 -0.9293827 0.83254425
3 -0.3738848 0.4389987 -0.37976189 0.4559152 1.5331982 -0.92442949
4 -0.1317654 0.1850702 -0.08076331 -0.1050413 -0.2406168 0.02128076
Office Construction Production Unemployment IncomePerCap
1 0.06629532 -0.2290380 -0.4319137 -0.46098899 32840.09
2 -0.05484069 0.3137655 0.5749681 0.89883620 14433.73
3 -0.17952039 -0.7693828 -1.0184110 -0.62970642 45780.77
4 0.04953752 0.1856318 0.2240905 -0.09992813 23412.84
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
4 2 4 4 4 1 4 1 4 4 4 4 4 2 4 1 4 2 1 1 4 4 1 2 4 4
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
2 4 1 1 3 1 1 4 1 3 4 1 1 4 4 4 4 4 2 2 2 2 2 4 4 2
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
2 4 4 2 4 4 4 2 2 2 4 4 4 2 2 4 4 4 2 4 2 4 4
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 175536566241 191571930523 172553341760 175374034150
(between_SS / total_SS = 90.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

List of 9
$ cluster : Named int [1:69567] 1 8 10 1 5 5 10 6 1 1 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:10, 1:11] -0.219 1.288 -0.361 -0.401 -0.269 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:10] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:10] 1.12e+10 1.74e+10 1.39e+10 1.27e+10 1.15e+10 ...
$ tot.withinss: num 1.28e+11
$ betweenss : num 7.18e+12
$ size : int [1:10] 10068 3865 3958 2488 8688 6934 7565 9957 5277 10767
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 10 clusters of sizes 10068, 3865, 3958, 2488, 8688, 6934, 7565, 9957, 5277, 10767
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.21894266 0.3228535 -0.18234201 -0.087339706 -0.06461379 -0.11255017
2 1.28797955 -1.3954285 0.66203610 -0.227681863 -1.10820404 1.22557254
3 -0.36067127 0.4329611 -0.37594612 0.433931771 1.47068572 -0.88529754
4 -0.40143948 0.4574908 -0.41624907 0.552865396 1.86667764 -1.11560010
5 -0.26857421 0.3814098 -0.24191694 0.001449229 0.27733630 -0.29947970
6 -0.31091712 0.4238401 -0.30526537 0.133107865 0.63900706 -0.46799362
Office Construction Production Unemployment IncomePerCap
1 0.066108550 0.12256875 0.08790777 -0.24792612 25611.181
2 -0.076791743 0.32434445 0.48520322 1.65358131 9524.098
3 -0.150092261 -0.74228280 -0.99238368 -0.63110451 44529.439
4 -0.323599903 -0.93488726 -1.16996086 -0.66198939 51688.147
5 0.094878659 -0.05916413 -0.20555508 -0.38831757 29399.124
6 0.075622849 -0.27384931 -0.49047224 -0.48675137 33668.105
[ reached getOption("max.print") -- omitted 4 rows ]
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
1 8 10 1 5 5 10 6 1 1 10 10 8 8 1 5 10 7 6 9 1 1 5 8 10 10
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
8 8 5 5 9 6 6 10 6 3 10 5 6 10 8 8 1 8 7 8 7 8 7 8 10 8
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
8 8 10 7 8 10 10 8 8 8 8 1 8 7 8 10 8 1 8 10 7 8 10
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 11229084259 17370175686 13910959117 12674139695 11527257287 11975510243
[7] 11954310598 12586251071 12940548385 11846270069
(between_SS / total_SS = 98.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

K-means is an unsupervised learning algorithm. The goal of this program is to find groups or clusters of data in order to identify certain patterns. All of the values in the data set were normalized to make comparisons of the overall dataset on a similar scale. K-means was used for 2,3,4, and 10 clusters. On inspection of the clusters created from k=2, The cluster that had the highest IncomePerCap at 37598 had the highest cluster mean of professional at 0.928, White at 0.420 and Asian at 0.237. the cluster plot chart has all the 70,000 datapoints in green and the two different clusters in blue and red respectively.It appears that there is overlap of the clusters however this occurs as the plot takes all the different data points and plots them on a two dimensional graph. With only two clusters it captures about 65.8% of the cluster sum of squares. Further inspection was constructed for a model with k =3. The cluster with the highest IncomePerCap was found to be cluster three at 42760. this cluster also had the highest cluster mean for Professional at 1.330 and Asian at 0.3837. The first cluster which had a IncomePerCap cluster mean of 16577 had the highest uneployment cluster average at 0.619. the cluster plot has three distinct clusters portrayed and the overlap makes it a little difficult to see which cluster is which. With only three clusters, 83.3% of the data is captured which is a drastric improvement from only two clusters. A final analysis was constructed for a model with k=4. The cluster with the highest IncomePerCap was found to be cluster three with 45781. this cluster had the hgihest Professional cluster average at 1.533 and the highest Asian cluster averge at 0.456. The cluster with the lowest IncomePerCap was cluster two at 14434. It had the highest unemployment cluster average at 0.8988. The cluster plot is difficult to interpret as the all of the datapoints were brought to a two dimensional scale and now there are four different clusters. With only four clusters, 90.2% of the data is captured which is a drastric improvement from only two clusters. As the clusters increased from 4 to 10, the percentage captured did not increase drastically. For example when k= 10, 98.2% of the data is captured. Therefore, a cluster of four would be sufficient
KNN
Preprocessing KNN
Besides prediction, classification was another method that was performed in order to understand income per capita of census tracts. Is it possible to classify future census tract’s income per capita based on the information given related to work and ethnic variations? To answer this question the dependent variable was transformed into a classifiable variable. In other words, income per capita, a numerical variable, was splitted into four groups: low, medium low, medium high, and high income per capita.
The function quant_groups was used. This separated the variable “income per capita” in quartiles and created a categorical variable with 4 categories. Each category has the same amount of observations (17418), and contains all census tracts ranging from the income per capita levels of each quartile. The new categorical variable was named “ipc4” and included:
- Low: 0 - 18,000
- Mid-Low: 18,800 - 24,700
- Mid-High: 24,700 - 32,200
- High: 32,000 - 56,000
Separately, another variable was created from “income per capita” with only 2 categories. The purpose is to compare how classifying income per capita using 4 categories differs from classifying it using 2 categories. Therefore, a second categorical variable was created and named “ipc2” containing:
- Low: 0 - 24,700
- High: 24,700 - 56,000
Both new categorical variables were added to the dataset “dat_1”.
'data.frame': 69567 obs. of 12 variables:
$ Hispanic : num 0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
$ White : num 87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
$ Black : num 7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
$ Asian : num 0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
$ Professional: num 34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
$ Service : num 17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
$ Office : num 21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
$ Construction: num 11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
$ Production : num 15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
$ Unemployment: num 5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
$ ipc2 : Factor w/ 2 levels "Low","High": 2 1 1 1 2 2 1 2 1 1 ...
$ ipc4 : Factor w/ 4 levels "Low","Mid-Low",..: 3 1 2 2 3 3 2 4 2 2 ...
The dataset that was used to perform the KNN contained “ipc2” and “ipc4” along with the independent variables used to classify income per capita.
Train-Test split 3:1
First, the data was splitted into 80% training, and 20% test subsets. Two test sets were created: one for the categorical variable with 2 categories (ipc2) and another one for the categorical variable with 4 categories (ipc4).
KNN 2 categories
Selecting the correct “k”
The first KNN model performed tried to classify census tract’s income in two categories: Low or High. First, the chooseK() function was applied to determine the best KNN K value for the model.
num [1:2, 1:15] 1 0.204 3 0.177 5 ...

The graph shows that 9 is approximately the best value for k because it provides the highest accuracy. In other words, the optimal amount of neighbors used to classify each observation is 9, because the accuracy does not improve substantially after adding more neighbors.
Results
dat_ipc2.testLabels
dat_pred_ipc2 Low High
High 1940 9719
Low 9492 1685
[1] 0.1587406
As shown in the confusion matrix, the model can classify the income level of a census tract based on their work and ethnic demographics with an accuracy of 84%. For the test set, that corresponded to 19,210 observations predicted successfully out of 22,836. However, this high accuracy score corresponded only to the classification of two categories: low and high.
The reality of income in the U.S. is much more complex than that. Therefore, it makes sense to perform another model trying to classify 4 categories instead.
KNN 4 categories
Selecting the correct “k”
The next KNN model tried to classify census tract’s income in four categories: Low, Mid-low, Mid-High and High.
num [1:2, 1:15] 1 0.563 3 0.597 5 ...

The chooseK() function applied to determine the best KNN K value for the model shows that 9 is approximately the best value for k as well, because it provides the highest accuracy.
Results
dat_ipc4.testLabels
dat_pred_ipc4 Low Mid-Low Mid-High High
Low 4120 982 165 19
Mid-Low 1299 2996 1422 169
Mid-High 238 1510 2923 1089
High 66 221 1223 4394
[1] 0.6320284
As shown in the confusion matrix, the accuracy scored decreased from 84% to 63%. That means that is harder to classify income per capita when having 4 categories than having 2 categories. For the test set, that corresponded to 14,466 observations predicted successfully out of 22,836.
In reality, income per capita is a numerical variable. So the classification exercise served only as a proxy to understand how census tracts can be classified based on their demographics, but is not by any means a completely reliable tool to predict income in the U.S.
Linear Regression
'data.frame': 69672 obs. of 11 variables:
$ Hispanic : num 0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
$ White : num 87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
$ Black : num 7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
$ Asian : num 0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
$ Professional: num 34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
$ Service : num 17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
$ Office : num 21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
$ Construction: num 11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
$ Production : num 15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
$ Unemployment: num 5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
$ IncomePerCap: int 25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...
#Basic Linear Regression We begin our predictive modeling with a simple linear regression model. This will allow us to make basic statements about how the variables of interest correlate with income, but won’t be an appropriate final model, since it does not allow for interactions between components which we have already shown to be inter-related. We begin by doing an exhaustive search of our variables of interest.

The exhaustive search indicates, by adjusted R^2, that we should include all of our variables except Professional. Since we perceive few costs to expanding the model, we simply take the model with the highest adjusted value. Note that Professional is excluded not because it is individually unhelpful, but because the set of work types sums to one, so one of them will always be dropped. We also did feature selection with BIC and Cp, with similar results. For brevity, we do not include them.
Call:
lm(formula = IncomePerCap ~ . - Professional, data = datDisc)
Residuals:
Min 1Q Median 3Q Max
-57889 -3154 -136 3093 39362
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51988.355 396.421 131.14 <2e-16 ***
Hispanic 41.671 3.749 11.11 <2e-16 ***
White 96.026 3.777 25.42 <2e-16 ***
Black 53.109 3.803 13.97 <2e-16 ***
Asian 128.931 4.808 26.82 <2e-16 ***
Service -536.921 3.174 -169.15 <2e-16 ***
Office -383.300 3.848 -99.62 <2e-16 ***
Construction -444.860 4.003 -111.12 <2e-16 ***
Production -533.878 3.054 -174.82 <2e-16 ***
Unemployment -270.607 4.495 -60.20 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5519 on 69557 degrees of freedom
(105 observations deleted due to missingness)
Multiple R-squared: 0.7102, Adjusted R-squared: 0.7101
F-statistic: 1.894e+04 on 9 and 69557 DF, p-value: < 2.2e-16
Call:
lm(formula = IncomePerCap ~ . - Professional, data = datDisc)
Coefficients:
(Intercept) Hispanic White Black Asian
51988.36 41.67 96.03 53.11 128.93
Service Office Construction Production Unemployment
-536.92 -383.30 -444.86 -533.88 -270.61
Hispanic White Black Asian Service Office
17.508348 31.446591 16.187328 3.948750 1.475878 1.153010
Construction Production Unemployment
1.291156 1.195779 1.621923
(Intercept) Hispanic White Black Asian Service
51988.35533 41.67122 96.02584 53.10864 128.93092 -536.92067
Office Construction Production Unemployment
-383.29973 -444.85965 -533.87800 -270.60749
Taking a simple linear regression, we find that we can predict the level income with reasonably high effectiveness (R^2 = `r )
Call:
glm(formula = ipc4 ~ . - Professional, family = "binomial", data = dat_ipc %>%
select(-ipc2))
Deviance Residuals:
Min 1Q Median 3Q Max
-5.0524 0.0011 0.1756 0.4267 3.7614
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.196189 0.239132 38.457 < 2e-16 ***
Hispanic 0.006449 0.002061 3.129 0.00176 **
White 0.034181 0.002088 16.371 < 2e-16 ***
Black 0.016145 0.002089 7.730 1.08e-14 ***
Asian 0.046923 0.002858 16.416 < 2e-16 ***
Service -0.158933 0.002283 -69.614 < 2e-16 ***
Office -0.092472 0.002650 -34.893 < 2e-16 ***
Construction -0.108381 0.002590 -41.851 < 2e-16 ***
Production -0.141521 0.002123 -66.665 < 2e-16 ***
Unemployment -0.145545 0.002835 -51.345 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 78240 on 69566 degrees of freedom
Residual deviance: 38592 on 69557 degrees of freedom
AIC: 38612
Number of Fisher Scoring iterations: 6
Hispanic White Black Asian Service Office
16.258238 25.125472 15.710318 2.872471 1.550756 1.503556
Construction Production Unemployment
1.443056 1.387141 1.127783
Area under the curve: 0.8501

Ridge Regression
[1] 15 100

The dataset had 10 dependent variables predicting the independent variable. Ridge regression was introduced as it minimized the residual sum of squares and has a shrinkage penalty of lambda times by the sum of squares of the coefficients. Overall as lambda increases, the coefficients apprach zero. This plot indicates the entire path of variables as they shring towards zero. To build the ridge regression, a log sacle grid for the lambda values was constucted from 10^10 to 10^-2 in 100 segments.
Train and Test sets
To avoid introducing a bias in developing the Ridge and Lasso regression a train and test data set were introduced. To simulate a train and test set there was a random split into 50% for the train set.

[1] 824.8974
lowest lamda from CV: 824.8974
MSE for best Ridge lamda: 12154666
All the coefficients :
(Intercept) Hispanic White Black Asian Professional
19143.7093 -300.1618 436.8876 -34.8267 267.4880 1313.3468
Service Office Construction Production Unemployment
-988.9918 -358.9968 -465.5783 -707.8713 -806.0590
R^2:
[1] 0.8843423
In order to obtain the best model for Ridge regression, cross validation was implimented to find the best fit. The cross validation line graph indicates that a model with ten dependent variables would yield the best lambda with the lowest mean square error. As the lambda value decreases, the mean square error also decreases. Overall, Ridge Regression includes all the of the dependent variables and the best value for lambda is indicated by the first vertical dotted line. The lowest lamda from the cross validation was found to be 825. The MSE for the best Ridge Lambda equation was 30834392. from the equation, the model that had the most positive coefficient valus were professional at 3248, White at 1104 and Asian at 546. The values that had the strongest negative coeffiecents were service at -2195, production at -2021 and unemployment at -1602. It was interesting to note that only professional had a positive lambda while the other work variables were all negative. The R^2 value for the best Ridge model was found to be 0.707. this means that 70.7% of the variation in the income can be explained by the model.
Lasso Regression


lowest lamda from CV: 25.78395
MSE for best Lasso lamda: 11672982
All the coefficients :
(Intercept) Hispanic White Black Asian Professional
17484.872000 -212.822777 219.903152 0.000000 165.286025 1987.274255
Service Office Construction Production Unemployment
-284.732831 8.180852 0.000000 -4.656966 -619.976123
The non-zero coefficients :
(Intercept) Hispanic White Asian Professional Service
17484.872000 -212.822777 219.903152 165.286025 1987.274255 -284.732831
Office Production Unemployment
8.180852 -4.656966 -619.976123
[1] 0.8889258
Lasso regression was also implimented to see if this model would perform differently from the OLS or Ridge model. Lasso regression can be useful in reducing over-fittness and assist in model selection. from the line plot it can be seen that the three most positive coefficient values are professional at 6030.2, white at 1690, and asian at 712.2. This means that Professional, White, and Asian have much stronger positive pull on the data that the other variables. The three most negtive coefficient values are unemployment at -1613, service at -716.5, and producton at -622.3. Construction was found to have a coefficient value of 0.0 so it was removed for the final Lasso model. It is interesting to note that the lambda values for hispanic are small at 13.6 so they do not deviate much from the ordinary least squares model (OLS). Cross validation was introduced to select the lambda value with the lowest MSE. The CV recommended eight dependent variables be used to predict income. The Lasso regresison recommended that construction be removed from the equation. the Cross validaiton value was found to be 16.2 and the MSE for the best Lasso model was 30709528. Also the r^2 value was found to be 0.708. this means that 70.8% of the variation in Income can be explained by the model.
Call:
lm(formula = IncomePerCap ~ ., data = datJLClean)
Residuals:
Min 1Q Median 3Q Max
-27748.0 -1894.5 -20.2 1770.9 18988.4
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17391.74 36.71 473.707 < 2e-16 ***
Hispanic 104.54 54.22 1.928 0.0539 .
White 669.66 72.93 9.183 < 2e-16 ***
Black 334.20 52.11 6.414 1.43e-10 ***
Asian 326.28 25.82 12.635 < 2e-16 ***
Professional 1077.33 2706.77 0.398 0.6906
Service -814.52 1609.32 -0.506 0.6128
Office -358.87 1173.52 -0.306 0.7598
Construction -378.99 1193.57 -0.318 0.7508
Production -515.94 1505.74 -0.343 0.7319
Unemployment -616.00 17.07 -36.087 < 2e-16 ***
ipc2High 19892.01 62.25 319.559 < 2e-16 ***
ipc4Mid-Low 5181.26 42.82 120.998 < 2e-16 ***
ipc4Mid-High -9860.07 41.82 -235.757 < 2e-16 ***
ipc4High NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3412 on 69553 degrees of freedom
Multiple R-squared: 0.8893, Adjusted R-squared: 0.8892
F-statistic: 4.296e+04 on 13 and 69553 DF, p-value: < 2.2e-16
Call:
lm(formula = IncomePerCap ~ Hispanic + White + Black + Asian +
Professional + Service + Office + Production + Unemployment,
data = datJLClean)
Residuals:
Min 1Q Median 3Q Max
-57889 -3155 -139 3092 39315
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26167.82 20.93 1250.46 <2e-16 ***
Hispanic 972.98 87.56 11.11 <2e-16 ***
White 2983.19 117.35 25.42 <2e-16 ***
Black 1175.89 84.20 13.97 <2e-16 ***
Asian 1115.17 41.58 26.82 <2e-16 ***
Professional 5991.86 53.93 111.11 <2e-16 ***
Service -737.90 39.77 -18.55 <2e-16 ***
Office 359.14 28.63 12.54 <2e-16 ***
Production -667.56 40.62 -16.44 <2e-16 ***
Unemployment -1604.14 26.65 -60.19 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5519 on 69557 degrees of freedom
Multiple R-squared: 0.7102, Adjusted R-squared: 0.7101
F-statistic: 1.894e+04 on 9 and 69557 DF, p-value: < 2.2e-16
MSE for full model :
[1] 11638291
MSE for full model (w/o construction) :
[1] 30460435
An OLS model was consturcted for both the full model and the full model without the construction variable to compare them to the Ridge and Lasso models. the R^2 value for both the OLS models was found to be 0.71. this means that both the ordinary least squares models explain 71% of the variation in income can be explainedby the model. Furthermore the MSE for the full model was found to be 30459848. The full model withouth the construction variable was found to have a larger MSE at 30460435. Overall the Lasso, Ridge, and both OLS models explain aobut roughly the same amount of variability in the data. Also all of the R^2 values are about the same around 0.70. Since the full OLS has the lowest MSE and the highest R^2 it would be a more suitable option than the Ridge, Lasso, or OLS without construction.
Conclusion
Overall, this analysis found that there are several ways in which our independent variables reliably predict income in communities across the United States. Ethnicity and work type proportions had stronger predictive power, with the latter having the most powerful effects. However, these variables suffer from being largely non-normal, with a rightward skew, and from having high internal correlations, both between and within the two categories. Altogether, these variables allow us to predict income per capita at the census tract level with high reliability (R-squared = .67); this is actually quite impressive given the simplicity of this data. For instance, it does not directly include any information about the age or education of the population.
Lasso and Ridge regression were introduced to see if other forms of regression would be more effective than ordinary least squares models. Cross validations was used for both the lasso and ridge regression to find the best combinations of coefficients that would have the lowest MSE. Although the models cannot be directly compared as each use different penalizing factors, the R^2 value was used for comparing how much variation in income can be explained by the respective models. The R^2 values for the Ridge Lasso regression were 0.707 and 0.708. The full OLS model has a slightly higher R^2 value at 0.711. So all the models explain about roughly the data since the OLS is slightly higher, that would be the recommend choice.
Clustering:
K-means was utilizes to see how the data would group together so clusters with different IncomePerCaps may have specific ethnicity or work variation cluster means. Clusters were constructed for k=2,3,4, and 10. Overall for larger income groups, it appears to have higher averages for Professional. Also, as more clusters were added, a greater percentage of data variation was explained. Four clusters became the optimal choice as it explained 90.2 % while 3 clusters explained 83.3% and 10 clusters explained 98.2%. There wasn’t much of an increase from 4 to 10 and just enough increase from 3 to 4.
Classification:
Moving forward, this analysis allows for several expansions. The first is to integrate new data, such as age and education status of census tract residents. Additionally, it may be valuable to consider each of the individual freedom measures on its own, to negate the influence of high internal correlation. Finally, it is interesting if there are differences driven by geographic density, which can be estimated with just the currently accessible data.
Bibliography
MuonNeutrino. (2015). US Census Demographic Data: Demographic and Economic Data for Tracts and Counties. UpToDate. Retrieved March 23, 2020, from https://www.kaggle.com/muonneutrino/us-census-demographic-dataD
U.S. Census Bureau (2019). “Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2019”. 2010-2019 Population Estimates. United States Census Bureau, Population Division. December 30, 2019. Retrieved January 27, 2020.
U.S. Census Bureau (2017). “American FactFinder - Results”. U.S. Census Bureau. Retrieved 2017-12-13.
U.S. Census Bureau (2013). “2010 Census Summary File 1: GEOGRAPHIC IDENTIFIERS”. American Factfinder. US Census. Retrieved 18 October 2013.